A combination of DHTs and Peer Clustering for Distributed Information Retrieval

نویسندگان

Odysseas Papapetrou

Wolf Siberski

Wolf-Tilo Balke

Wolfgang Nejdl

چکیده

Distributed Hash Tables (DHTs) are very efficient for querying based on key lookups, if only a small number of keys has to be registered by each individual peer. However, building huge term indexes, as required for IR-style keyword search, are impractical with plain DHTs. Due to the large sizes of document term vocabularies, joining peers cause huge amounts of key inserts, and subsequently large numbers of index maintenance messages. Thus, the key to exploiting DHTs for distributed information retrieval is to reduce index maintenance. We show that this can be achieved by combining DHTs with peer clustering. Peers are first clustered into communities, each of the communities having a representative super-peer. Then all occurrences of a term in a community are published to the global DHT in a batch by the representative super-peer. Our evaluation shows that this reduces index maintenance cost by an order of magnitude, while still keeping a complete and correct term index for query processing.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Improving the Throughput of Distributed Hash Tables Using Congestion-Aware Routing

Advanced applications for Distributed Hash Tables (DHTs), such as Peer-to-Peer Information Retrieval, require a DHT to quickly and efficiently process a large number (in the order of millions) of requests. In this paper we study mechanisms to optimize the throughput of DHTs. Our goal is to maximize the number of route operations per peer per second a DHT can perform (given certain constraints o...

متن کامل

Aggregation of a Term Vocabulary for Peer-to-Peer Information Retrieval: a DHT Stress Test

There has been an increasing research interest in developing full-text retrieval based on peer-to-peer (P2P) technology. So far, these research efforts have largely concentrated on efficiently distributing an index. However, ranking of the results retrieved from the index is a crucial part in information retrieval. To determine the relevance of a document to a query, ranking algorithms use coll...

متن کامل

A Tabu-Based Cache to Improve Range Queries on Prefix Trees

Distributed Hash Tables (DHTs) provide the substrate to build large scale distributed applications over Peerto-Peer networks. A major limitation of DHTs is that they only support exact-match queries. In order to offer range queries over a DHT it is necessary to build additional indexing structures. Prefix-based indexes, such as Prefix Hash Tree (PHT), are interesting approaches for building dis...

متن کامل

PCIR: Combining DHTs and peer clusters for efficient full-text P2P indexing

Distributed hash tables (DHTs) are very efficient for querying based on key lookups. However, building huge term indexes, as required for IR-style keyword search, poses a scalability challenge for plain DHTs. Due to the large sizes of document term vocabularies, peers joining the network cause huge amounts of key inserts and, consequently, a large number of index maintenance messages. Thus, the...

متن کامل